Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 25
Filtrar
1.
Nat Biotechnol ; 2024 Mar 21.
Artigo em Inglês | MEDLINE | ID: mdl-38514799

RESUMO

Spatially resolved gene expression profiling provides insight into tissue organization and cell-cell crosstalk; however, sequencing-based spatial transcriptomics (ST) lacks single-cell resolution. Current ST analysis methods require single-cell RNA sequencing data as a reference for rigorous interpretation of cell states, mostly do not use associated histology images and are not capable of inferring shared neighborhoods across multiple tissues. Here we present Starfysh, a computational toolbox using a deep generative model that incorporates archetypal analysis and any known cell type markers to characterize known or new tissue-specific cell states without a single-cell reference. Starfysh improves the characterization of spatial dynamics in complex tissues using histology images and enables the comparison of niches as spatial hubs across tissues. Integrative analysis of primary estrogen receptor (ER)-positive breast cancer, triple-negative breast cancer (TNBC) and metaplastic breast cancer (MBC) tissues led to the identification of spatial hubs with patient- and disease-specific cell type compositions and revealed metabolic reprogramming shaping immunosuppressive hubs in aggressive MBC.

2.
bioRxiv ; 2023 Nov 15.
Artigo em Inglês | MEDLINE | ID: mdl-38014231

RESUMO

Single-cell genomics has the potential to map cell states and their dynamics in an unbiased way in response to perturbations like disease. However, elucidating the cell-state transitions from healthy to disease requires analyzing data from perturbed samples jointly with unperturbed reference samples. Existing methods for integrating and jointly visualizing single-cell datasets from distinct contexts tend to remove key biological differences or do not correctly harmonize shared mechanisms. We present Decipher, a model that combines variational autoencoders with deep exponential families to reconstruct derailed trajectories (https://github.com/azizilab/decipher). Decipher jointly represents normal and perturbed single-cell RNA-seq datasets, revealing shared and disrupted dynamics. It further introduces a novel approach to visualize data, without the need for methods such as UMAP or TSNE. We demonstrate Decipher on data from acute myeloid leukemia patient bone marrow specimens, showing that it successfully characterizes the divergence from normal hematopoiesis and identifies transcriptional programs that become disrupted in each patient when they acquire NPM1 driver mutations.

3.
Sci Adv ; 8(42): eade6585, 2022 Oct 21.
Artigo em Inglês | MEDLINE | ID: mdl-36260667

RESUMO

Statistical and machine learning methods help social scientists and other researchers make causal inferences from texts.

4.
J Biomed Inform ; 134: 104204, 2022 10.
Artigo em Inglês | MEDLINE | ID: mdl-36108816

RESUMO

Confounding remains one of the major challenges to causal inference with observational data. This problem is paramount in medicine, where we would like to answer causal questions from large observational datasets like electronic health records (EHRs) and administrative claims. Modern medical data typically contain tens of thousands of covariates. Such a large set carries hope that many of the confounders are directly measured, and further hope that others are indirectly measured through their correlation with measured covariates. How can we exploit these large sets of covariates for causal inference? To help answer this question, this paper examines the performance of the large-scale propensity score (LSPS) approach on causal analysis of medical data. We demonstrate that LSPS may adjust for indirectly measured confounders by including tens of thousands of covariates that may be correlated with them. We present conditions under which LSPS removes bias due to indirectly measured confounders, and we show that LSPS may avoid bias when inadvertently adjusting for variables (like colliders) that otherwise can induce bias. We demonstrate the performance of LSPS with both simulated medical data and real medical data.


Assuntos
Fatores de Confusão Epidemiológicos , Viés , Causalidade , Pontuação de Propensão
5.
Biostatistics ; 23(2): 643-665, 2022 04 13.
Artigo em Inglês | MEDLINE | ID: mdl-33417699

RESUMO

Personalized cancer treatments based on the molecular profile of a patient's tumor are an emerging and exciting class of treatments in oncology. As genomic tumor profiling is becoming more common, targeted treatments for specific molecular alterations are gaining traction. To discover new potential therapeutics that may apply to broad classes of tumors matching some molecular pattern, experimentalists and pharmacologists rely on high-throughput, in vitro screens of many compounds against many different cell lines. We propose a hierarchical Bayesian model of how cancer cell lines respond to drugs in these experiments and develop a method for fitting the model to real-world high-throughput screening data. Through a case study, the model is shown to capture nontrivial associations between molecular features and drug response, such as requiring both wild type TP53 and overexpression of MDM2 to be sensitive to Nutlin-3(a). In quantitative benchmarks, the model outperforms a standard approach in biology, with $\approx20\%$ lower predictive error on held out data. When combined with a conditional randomization testing procedure, the model discovers markers of therapeutic response that recapitulate known biology and suggest new avenues for investigation. All code for the article is publicly available at https://github.com/tansey/deep-dose-response.


Assuntos
Antineoplásicos , Neoplasias , Antineoplásicos/farmacologia , Teorema de Bayes , Avaliação Pré-Clínica de Medicamentos/métodos , Detecção Precoce de Câncer , Ensaios de Triagem em Larga Escala , Humanos , Neoplasias/tratamento farmacológico , Neoplasias/genética
6.
Int Stat Rev ; 88(Suppl 1): S91-S113, 2020 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-35356801

RESUMO

Analyzing data from large-scale, multi-experiment studies requires scientists to both analyze each experiment and to assess the results as a whole. In this article, we develop double empirical Bayes testing (DEBT), an empirical Bayes method for analyzing multi-experiment studies when many covariates are gathered per experiment. DEBT is a two-stage method: in the first stage, it reports which experiments yielded significant outcomes; in the second stage, it hypothesizes which covariates drive the experimental significance. In both of its stages, DEBT builds on Efron (2008), which lays out an elegant empirical Bayes approach to testing. DEBT enhances this framework by learning a series of black box predictive models to boost power and control the false discovery rate (FDR). In Stage 1, it uses a deep neural network prior to report which experiments yielded significant outcomes. In Stage 2, it uses an empirical Bayes version of the knockoff filter (Candes et al., 2018) to select covariates that have significant predictive power of Stage-1 significance. In both simulated and real data, DEBT increases the proportion of discovered significant outcomes and selects more features when signals are weak. In a real study of cancer cell lines, DEBT selects a robust set of biologically-plausible genomic drivers of drug sensitivity and resistance in cancer.

7.
Mol Syst Biol ; 15(2): e8557, 2019 02 22.
Artigo em Inglês | MEDLINE | ID: mdl-30796088

RESUMO

Common approaches to gene signature discovery in single-cell RNA-sequencing (scRNA-seq) depend upon predefined structures like clusters or pseudo-temporal order, require prior normalization, or do not account for the sparsity of single-cell data. We present single-cell hierarchical Poisson factorization (scHPF), a Bayesian factorization method that adapts hierarchical Poisson factorization (Gopalan et al, 2015, Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence, 326) for de novo discovery of both continuous and discrete expression patterns from scRNA-seq. scHPF does not require prior normalization and captures statistical properties of single-cell data better than other methods in benchmark datasets. Applied to scRNA-seq of the core and margin of a high-grade glioma, scHPF uncovers marked differences in the abundance of glioma subpopulations across tumor regions and regionally associated expression biases within glioma subpopulations. scHFP revealed an expression signature that was spatially biased toward the glioma-infiltrated margins and associated with inferior survival in glioblastoma.


Assuntos
Glioma/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Célula Única , Transcriptoma/genética , Teorema de Bayes , Regulação Neoplásica da Expressão Gênica/genética , Glioma/patologia , Humanos , Distribuição de Poisson
8.
PLoS One ; 13(4): e0195024, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29630604

RESUMO

OBJECTIVE: Hospital readmission costs a lot of money every year. Many hospital readmissions are avoidable, and excessive hospital readmissions could also be harmful to the patients. Accurate prediction of hospital readmission can effectively help reduce the readmission risk. However, the complex relationship between readmission and potential risk factors makes readmission prediction a difficult task. The main goal of this paper is to explore deep learning models to distill such complex relationships and make accurate predictions. MATERIALS AND METHODS: We propose CONTENT, a deep model that predicts hospital readmissions via learning interpretable patient representations by capturing both local and global contexts from patient Electronic Health Records (EHR) through a hybrid Topic Recurrent Neural Network (TopicRNN) model. The experiment was conducted using the EHR of a real world Congestive Heart Failure (CHF) cohort of 5,393 patients. RESULTS: The proposed model outperforms state-of-the-art methods in readmission prediction (e.g. 0.6103 ± 0.0130 vs. second best 0.5998 ± 0.0124 in terms of ROC-AUC). The derived patient representations were further utilized for patient phenotyping. The learned phenotypes provide more precise understanding of readmission risks. DISCUSSION: Embedding both local and global context in patient representation not only improves prediction performance, but also brings interpretable insights of understanding readmission risks for heterogeneous chronic clinical conditions. CONCLUSION: This is the first of its kind model that integrates the power of both conventional deep neural network and the probabilistic generative models for highly interpretable deep patient representation learning. Experimental results and case studies demonstrate the improved performance and interpretability of the model.


Assuntos
Registros Eletrônicos de Saúde/estatística & dados numéricos , Insuficiência Cardíaca/terapia , Modelos Estatísticos , Alta do Paciente/normas , Readmissão do Paciente , Humanos , Fatores de Risco
9.
Proc Natl Acad Sci U S A ; 115(13): 3308-3313, 2018 03 27.
Artigo em Inglês | MEDLINE | ID: mdl-29531061

RESUMO

Assessing scholarly influence is critical for understanding the collective system of scholarship and the history of academic inquiry. Influence is multifaceted, and citations reveal only part of it. Citation counts exhibit preferential attachment and follow a rigid "news cycle" that can miss sustained and indirect forms of influence. Building on dynamic topic models that track distributional shifts in discourse over time, we introduce a variant that incorporates features, such as authorship, affiliation, and publication venue, to assess how these contexts interact with content to shape future scholarship. We perform in-depth analyses on collections of physics research (500,000 abstracts; 102 years) and scholarship generally (JSTOR repository: 2 million full-text articles; 130 years). Our measure of document influence helps predict citations and shows how outcomes, such as winning a Nobel Prize or affiliation with a highly ranked institution, boost influence. Analysis of citations alongside discursive influence reveals that citations tend to credit authors who persist in their fields over time and discount credit for works that are influential over many topics or are "ahead of their time." In this way, our measures provide a way to acknowledge diverse contributions that take longer and travel farther to achieve scholarly appreciation, enabling us to correct citation biases and enhance sensitivity to the full spectrum of scholarly impact.

10.
Neuroimage ; 180(Pt A): 243-252, 2018 10 15.
Artigo em Inglês | MEDLINE | ID: mdl-29448074

RESUMO

Recent research shows that the covariance structure of functional magnetic resonance imaging (fMRI) data - commonly described as functional connectivity - can change as a function of the participant's cognitive state (for review see Turk-Browne, 2013). Here we present a Bayesian hierarchical matrix factorization model, termed hierarchical topographic factor analysis (HTFA), for efficiently discovering full-brain networks in large multi-subject neuroimaging datasets. HTFA approximates each subject's network by first re-representing each brain image in terms of the activities of a set of localized nodes, and then computing the covariance of the activity time series of these nodes. The number of nodes, along with their locations, sizes, and activities (over time) are learned from the data. Because the number of nodes is typically substantially smaller than the number of fMRI voxels, HTFA can be orders of magnitude more efficient than traditional voxel-based functional connectivity approaches. In one case study, we show that HTFA recovers the known connectivity patterns underlying a collection of synthetic datasets. In a second case study, we illustrate how HTFA may be used to discover dynamic full-brain activity and connectivity patterns in real fMRI data, collected as participants listened to a story. In a third case study, we carried out a similar series of analyses on fMRI data collected as participants viewed an episode of a television show. In these latter case studies, we found that the HTFA-derived activity and connectivity patterns can be used to reliably decode which moments in the story or show the participants were experiencing. Further, we found that these two classes of patterns contained partially non-overlapping information, such that decoders trained on combinations of activity-based and dynamic connectivity-based features performed better than decoders trained on activity or connectivity patterns alone. We replicated this latter result with two additional (previously developed) methods for efficiently characterizing full-brain activity and connectivity patterns.


Assuntos
Mapeamento Encefálico/métodos , Encéfalo/fisiologia , Rede Nervosa/fisiologia , Análise Fatorial , Humanos , Processamento de Imagem Assistida por Computador , Imageamento por Ressonância Magnética/métodos
11.
Proc Natl Acad Sci U S A ; 114(33): 8689-8692, 2017 Aug 15.
Artigo em Inglês | MEDLINE | ID: mdl-28784795

RESUMO

Data science has attracted a lot of attention, promising to turn vast amounts of data into useful predictions and insights. In this article, we ask why scientists should care about data science. To answer, we discuss data science from three perspectives: statistical, computational, and human. Although each of the three is a critical component of data science, we argue that the effective combination of all three components is the essence of what data science is about.

12.
Nat Genet ; 48(12): 1587-1590, 2016 12.
Artigo em Inglês | MEDLINE | ID: mdl-27819665

RESUMO

A major goal of population genetics is to quantitatively understand variation of genetic polymorphisms among individuals. The aggregated number of genotyped humans is currently on the order of millions of individuals, and existing methods do not scale to data of this size. To solve this problem, we developed TeraStructure, an algorithm to fit Bayesian models of genetic variation in structured human populations on tera-sample-sized data sets (1012 observed genotypes; for example, 1 million individuals at 1 million SNPs). TeraStructure is a scalable approach to Bayesian inference in which subsamples of markers are used to update an estimate of the latent population structure among individuals. We demonstrate that TeraStructure performs as well as existing methods on current globally sampled data, and we show using simulations that TeraStructure continues to be accurate and is the only method that can scale to tera-sample sizes.


Assuntos
Algoritmos , Biologia Computacional/métodos , Doença/genética , Marcadores Genéticos/genética , Predisposição Genética para Doença , Modelos Estatísticos , Polimorfismo de Nucleotídeo Único/genética , Teorema de Bayes , Genética Populacional , Humanos
13.
IEEE Trans Pattern Anal Mach Intell ; 37(2): 256-70, 2015 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-26353240

RESUMO

We develop a nested hierarchical Dirichlet process (nHDP) for hierarchical topic modeling. The nHDP generalizes the nested Chinese restaurant process (nCRP) to allow each word to follow its own path to a topic node according to a per-document distribution over the paths on a shared tree. This alleviates the rigid, single-path formulation assumed by the nCRP, allowing documents to easily express complex thematic borrowings. We derive a stochastic variational inference algorithm for the model, which enables efficient inference for massive collections of text documents. We demonstrate our algorithm on 1.8 million documents from The New York Times and 2.7 million documents from Wikipedia.

14.
IEEE Trans Pattern Anal Mach Intell ; 37(2): 334-45, 2015 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-26353245

RESUMO

Latent feature models are widely used to decompose data into a small number of components. Bayesian nonparametric variants of these models, which use the Indian buffet process (IBP) as a prior over latent features, allow the number of features to be determined from the data. We present a generalization of the IBP, the distance dependent Indian buffet process (dd-IBP), for modeling non-exchangeable data. It relies on distances defined between data points, biasing nearby data to share more features. The choice of distance measure allows for many kinds of dependencies, including temporal and spatial. Further, the original IBP is a special case of the dd-IBP. We develop the dd-IBP and theoretically characterize its feature-sharing properties. We derive a Markov chain Monte Carlo sampler for a linear Gaussian model with a dd-IBP prior and study its performance on real-world non-exchangeable data.

15.
IEEE Trans Pattern Anal Mach Intell ; 37(2): 346-58, 2015 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-26353246

RESUMO

Super-resolution methods form high-resolution images from low-resolution images. In this paper, we develop a new Bayesian nonparametric model for super-resolution. Our method uses a beta-Bernoulli process to learn a set of recurring visual patterns, called dictionary elements, from the data. Because it is nonparametric, the number of elements found is also determined from the data. We test the results on both benchmark and natural images, comparing with several other models from the research literature. We perform large-scale human evaluation experiments to assess the visual quality of the results. In a first implementation, we use Gibbs sampling to approximate the posterior. However, this algorithm is not feasible for large-scale data. To circumvent this, we then develop an online variational Bayes (VB) algorithm. This algorithm finds high quality dictionaries in a fraction of the time needed by the Gibbs sampler.


Assuntos
Processamento de Imagem Assistida por Computador/métodos , Algoritmos , Teorema de Bayes , Humanos , Estatísticas não Paramétricas
16.
Proc Natl Acad Sci U S A ; 112(26): E3441-50, 2015 Jun 30.
Artigo em Inglês | MEDLINE | ID: mdl-26071445

RESUMO

Admixture models are a ubiquitous approach to capture latent population structure in genetic samples. Despite the widespread application of admixture models, little thought has been devoted to the quality of the model fit or the accuracy of the estimates of parameters of interest for a particular study. Here we develop methods for validating admixture models based on posterior predictive checks (PPCs), a Bayesian method for assessing the quality of fit of a statistical model to a specific dataset. We develop PPCs for five population-level statistics of interest: within-population genetic variation, background linkage disequilibrium, number of ancestral populations, between-population genetic variation, and the downstream use of admixture parameters to correct for population structure in association studies. Using PPCs, we evaluate the quality of the admixture model fit to four qualitatively different population genetic datasets: the population reference sample (POPRES) European individuals, the HapMap phase 3 individuals, continental Indians, and African American individuals. We found that the same model fitted to different genomic studies resulted in highly study-specific results when evaluated using PPCs, illustrating the utility of PPCs for model-based analyses in large genomic studies.


Assuntos
Modelos Teóricos , Teorema de Bayes , Variação Genética , Humanos , Desequilíbrio de Ligação , Incerteza
17.
J Am Med Inform Assoc ; 22(4): 872-80, 2015 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-25896647

RESUMO

BACKGROUND: As adoption of electronic health records continues to increase, there is an opportunity to incorporate clinical documentation as well as laboratory values and demographics into risk prediction modeling. OBJECTIVE: The authors develop a risk prediction model for chronic kidney disease (CKD) progression from stage III to stage IV that includes longitudinal data and features drawn from clinical documentation. METHODS: The study cohort consisted of 2908 primary-care clinic patients who had at least three visits prior to January 1, 2013 and developed CKD stage III during their documented history. Development and validation cohorts were randomly selected from this cohort and the study datasets included longitudinal inpatient and outpatient data from these populations. Time series analysis (Kalman filter) and survival analysis (Cox proportional hazards) were combined to produce a range of risk models. These models were evaluated using concordance, a discriminatory statistic. RESULTS: A risk model incorporating longitudinal data on clinical documentation and laboratory test results (concordance 0.849) predicts progression from state III CKD to stage IV CKD more accurately when compared to a similar model without laboratory test results (concordance 0.733, P<.001), a model that only considers the most recent laboratory test results (concordance 0.819, P < .031) and a model based on estimated glomerular filtration rate (concordance 0.779, P < .001). CONCLUSIONS: A risk prediction model that takes longitudinal laboratory test results and clinical documentation into consideration can predict CKD progression from stage III to stage IV more accurately than three models that do not take all of these variables into consideration.


Assuntos
Registros Eletrônicos de Saúde , Insuficiência Renal Crônica/fisiopatologia , Medição de Risco , Idoso , Estudos de Coortes , Progressão da Doença , Feminino , Taxa de Filtração Glomerular , Humanos , Estudos Longitudinais , Masculino , Pessoa de Meia-Idade , Modelos Teóricos , Atenção Primária à Saúde , Modelos de Riscos Proporcionais , Análise de Sobrevida , Tempo
18.
PLoS One ; 9(5): e94914, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24804795

RESUMO

The neural patterns recorded during a neuroscientific experiment reflect complex interactions between many brain regions, each comprising millions of neurons. However, the measurements themselves are typically abstracted from that underlying structure. For example, functional magnetic resonance imaging (fMRI) datasets comprise a time series of three-dimensional images, where each voxel in an image (roughly) reflects the activity of the brain structure(s)-located at the corresponding point in space-at the time the image was collected. FMRI data often exhibit strong spatial correlations, whereby nearby voxels behave similarly over time as the underlying brain structure modulates its activity. Here we develop topographic factor analysis (TFA), a technique that exploits spatial correlations in fMRI data to recover the underlying structure that the images reflect. Specifically, TFA casts each brain image as a weighted sum of spatial functions. The parameters of those spatial functions, which may be learned by applying TFA to an fMRI dataset, reveal the locations and sizes of the brain structures activated while the data were collected, as well as the interactions between those structures.


Assuntos
Teorema de Bayes , Encéfalo/fisiologia , Análise Fatorial , Humanos , Imageamento por Ressonância Magnética , Modelos Neurológicos
19.
Neuroimage ; 98: 91-102, 2014 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-24791745

RESUMO

This paper extends earlier work on spatial modeling of fMRI data to the temporal domain, providing a framework for analyzing high temporal resolution brain imaging modalities such as electroencapholography (EEG). The central idea is to decompose brain imaging data into a covariate-dependent superposition of functions defined over continuous time and space (what we refer to as topographic latent sources). The continuous formulation allows us to parametrically model spatiotemporally localized activations. To make group-level inferences, we elaborate the model hierarchically by sharing sources across subjects. We describe a variational algorithm for parameter estimation that scales efficiently to large data sets. Applied to three EEG data sets, we find that the model produces good predictive performance and reproduces a number of classic findings. Our results suggest that topographic latent sources serve as an effective hypothesis space for interpreting spatiotemporal brain imaging data.


Assuntos
Mapeamento Encefálico , Encéfalo/fisiologia , Eletroencefalografia , Modelos Neurológicos , Modelos Estatísticos , Adolescente , Adulto , Algoritmos , Potenciais Evocados P300 , Humanos , Fatores de Tempo , Adulto Jovem
20.
Proc Natl Acad Sci U S A ; 110(36): 14534-9, 2013 Sep 03.
Artigo em Inglês | MEDLINE | ID: mdl-23950224

RESUMO

Detecting overlapping communities is essential to analyzing and exploring natural networks such as social networks, biological networks, and citation networks. However, most existing approaches do not scale to the size of networks that we regularly observe in the real world. In this paper, we develop a scalable approach to community detection that discovers overlapping communities in massive real-world networks. Our approach is based on a Bayesian model of networks that allows nodes to participate in multiple communities, and a corresponding algorithm that naturally interleaves subsampling from the network and updating an estimate of its communities. We demonstrate how we can discover the hidden community structure of several real-world networks, including 3.7 million US patents, 575,000 physics articles from the arXiv preprint server, and 875,000 connected Web pages from the Internet. Furthermore, we demonstrate on large simulated networks that our algorithm accurately discovers the true community structure. This paper opens the door to using sophisticated statistical models to analyze massive networks.


Assuntos
Algoritmos , Teorema de Bayes , Redes Comunitárias , Modelos Estatísticos , Simulação por Computador , Humanos , Comportamento Social , Processos Estocásticos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...